The following libraries will be used for this assignment
suppressWarnings(library(ggplot2))
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
library(plotly)
## Warning: package 'plotly' was built under R version 3.6.1
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(dbscan)
## Warning: package 'dbscan' was built under R version 3.6.1
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(class)
## Warning: package 'class' was built under R version 3.6.1
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.6.1
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.6.1
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
This assignment involves a forge detection problem. Specifically, we will work with the dataset Stamps , originally reported in Micenková, van Beusekom, and Shafait (2015), and available from the outlier data repository described in Campos et al. (2016). In particular, we use the original version of this dataset (not normalised, without duplicates), which contains 340 observations described by 9 variables (numerical predictors). Each observation (row) is a feature vector description of a Stamp, with 9 features (columns 1 to 9). Originally, this is a binary classification dataset, which contains forged (photocopied and printed) stamps as well as genuine (ink) stamps. The last (10th) column of the dataset contains the class labels (‘yes’ denotes forged, ‘no’ denotes genuine).
There are 309 genuine and 31 forged stamps. We will try to identify forged stamps (9.12%, presumably outliers) from genuine stamps (90.88%, presumably inliers), both in an unsupervised as well as in a supervised way. Class labels will be used for supervised learning only, in Activity 3 (supervised anomaly detection) of this assignment. In Activity 1 (PCA) and Activity 2 (unsupervised outlier detection), which are unsupervised, class labels will only be used for visualisation and external assessment of the results, learning will make use of the 9 numerical features only. The features are based on colour and printing properties of the stamps. The following code reads the data into memory and separates the 9 predictors ( PB_Predictors ) apart from the class labels ( PB_class ):
stamps <- read.table("Stamps_withoutdupl_09.csv", header=FALSE, sep=",", dec=".")
summary(stamps) # 9 Predictors (V1 to V9) and class labels (V10)
## V1 V2 V3 V4
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.04883 1st Qu.:0.3357 1st Qu.:0.04662 1st Qu.:0.2369
## Median :0.07872 Median :0.4252 Median :0.09584 Median :0.3456
## Mean :0.10265 Mean :0.4197 Mean :0.14193 Mean :0.3841
## 3rd Qu.:0.12214 3rd Qu.:0.4416 3rd Qu.:0.19458 3rd Qu.:0.5233
## Max. :1.00000 Max. :0.8645 Max. :1.00000 Max. :1.0000
## V5 V6 V7 V8
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.000656
## 1st Qu.:0.4524 1st Qu.:0.01010 1st Qu.:0.9183 1st Qu.:0.018221
## Median :0.5783 Median :0.01945 Median :0.9724 Median :0.032162
## Mean :0.5986 Mean :0.04151 Mean :0.9299 Mean :0.054981
## 3rd Qu.:0.7373 3rd Qu.:0.04528 3rd Qu.:0.9884 3rd Qu.:0.059360
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.000000
## V9 V10
## Min. :0.0000 no :309
## 1st Qu.:0.4492 yes: 31
## Median :0.5787
## Mean :0.5693
## 3rd Qu.:0.7247
## Max. :1.0000
PB_Predictors <- stamps[,1:9] # 9 Predictors (V1 to V9)
PB_class <- stamps[,10] # Class labels (V10)
PB_class <- ifelse(PB_class == 'no',0,1) # Inliers (class "no") = 0, Outliers (class "yes") = 1
1.Perform Principal Component Analysis (PCA) on the Stamps data in the 9-dimensional space of the numerical predictors ( PB_Predictors ), and show the Proportion of Variance Explained (PVE) for each of the nine resulting principal components. Plot the accumulated sum of PVE for the first components, as a function of , and discuss the result: (a) How many components do we need to explain 90% or more of the total variance? (b) How much of the total variance is explained by the first three components?
PCA <- prcomp(PB_Predictors, scale = TRUE)
PCA$rotation
## PC1 PC2 PC3 PC4 PC5 PC6
## V1 -0.3570003 -0.10417655 0.1813733 -0.01930228 0.68124006 0.44187577
## V2 0.1420765 -0.30721615 0.1299597 0.92287473 0.06310383 -0.10282233
## V3 -0.3317113 -0.04657823 -0.4767126 0.17893590 -0.46295498 0.39982667
## V4 -0.4230165 -0.22831476 -0.3981637 0.02135029 -0.06765320 -0.08319882
## V5 -0.3853972 -0.33542901 -0.1876273 -0.07414299 0.26797160 -0.51370341
## V6 -0.3966169 0.39119428 0.1187845 0.11471184 0.04270746 -0.39846576
## V7 0.2747583 -0.64007315 -0.1336053 -0.23489388 0.10043151 0.09738610
## V8 0.1884198 0.40303353 -0.5625883 0.19360251 0.45311956 0.20043999
## V9 0.3828311 0.05284966 -0.4251531 -0.06436384 0.15488411 -0.39378111
## PC7 PC8 PC9
## V1 -0.39311729 0.109611082 0.05031087
## V2 -0.03739759 -0.006920782 0.02905519
## V3 -0.31137242 0.026120489 -0.39585520
## V4 0.12867762 -0.063683586 0.76014917
## V5 0.26148066 0.279863425 -0.46460688
## V6 -0.26188791 -0.651400166 -0.09427161
## V7 -0.04026264 -0.635461942 -0.12955785
## V8 0.40078359 -0.181465084 -0.09499376
## V9 -0.65676748 0.209314272 0.10742798
# PVE - Proportion of the Variance Explained (PVE)
PVE <- (PCA$sdev^2)/sum(PCA$sdev^2)
# plot PVE
PVE.df <- data.frame(PVE)
PVE.df <- cbind(PVE.df, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9"))
colnames(PVE.df) <- c("PVE", "PC")
PVE.df$PC <- as.factor(PVE.df$PC)
PVE.plot <- ggplot(data=PVE.df, aes(x=PC, y= PVE)) +
geom_bar(stat="identity", fill = "#375094") +
labs(title = "Proportion of the Variance Explained (PVE)", x = "Principle Component(s)", y = "PVE %") +
geom_text(aes(label=sprintf("%0.3f", round(PVE.df$PVE, digits = 4))), position=position_dodge(width=0.9), vjust=-0.25) +
theme_minimal()
PVE.plot
# Plot Cumulative sum of PVE
plot(cumsum(PVE.df$PVE), main = "Cumulative Sum of PVE",
type = "b", ylab = "Cumulative Sum of PVE",
xlab = "Principle Component(s)")
(a) How many components do we need to explain 90% or more of the total variance?
There are 6 components needed to explain 90% or more of the total variance
# PC PVE value
PC1 <- 0.399
PC2 <- 0.152
PC3 <- 0.138
PC4 <- 0.100
PC5 <- 0.078
PC6 <- 0.063
# Sum
PC1+PC2+PC3+PC4+PC5+PC6
## [1] 0.93
(b) How much of the total variance is explained by the first three components?
# sum first 3 PC
PC1+PC2+PC3
## [1] 0.689
2.Do some research by yourself on how to render 3D plots in R, and then plot a 3D scatter-plot of the Stamps data as represented by the first three principal components computed in the previous item ( x = PC1 , y = PC2 , and z = PC3 ). You can use, for example, the function scatter3D() from the package plot3D . Use the class labels ( PB_class ) to plot inliers and outliers in different colours (for example, inliers in black and outliers in red). Make sure you produce multiple plots from different angles (at least three). Recalling that the class labels would not be available in a practical application of unsupervised outlier detection, do the outliers (forged stamps) look easy to detect in an unsupervised way, assuming that the 3D visualisation of the data via PCA is a reasonable representation of the data in full space? How about in a supervised way? Why? Justify your answers.
# data manipulation for 3D plotting
PCA.subset <- cbind(PCA$x[,1], PCA$x[,2], PCA$x[,3])
PCA.subset <- data.frame(PCA.subset)
colnames(PCA.subset) <- c("PC1", "PC2", "PC3")
PB_class[which(PB_class == 0)] <- "Inlier"
PB_class[which(PB_class == 1)] <- "Outlier"
PB_class <- as.factor(PB_class)
# set scene for different camera angles
scene1 = list(camera = list(eye = list(x = 1.45, y = 1.45, z = 1.45)))
scene2 = list(camera = list(eye = list(x = -0.75, y = 2.40, z = 0.30)))
scene3 = list(camera = list(eye = list(x = -2.40, y = 0, z = 0.75)))
scene4 = list(camera = list(eye = list(x = 0, y = -2.40, z = 0.75)))
# intentionally do not use a loop so that you are able to display plots in diffferent tabs in RMD file
# 3D plot 1
plot.3d.stamps1 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = PB_class, colors = c('black', 'red'),
marker = list(size = 4)) %>%
add_markers() %>%
layout(scene = scene1,
title = "Angle 1")
# 3D plot 2
plot.3d.stamps2 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = PB_class, colors = c('black', 'red'),
marker = list(size = 4)) %>%
add_markers() %>%
layout(scene = scene2,
title = "Angle 2")
# 3D plot 3
plot.3d.stamps3 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = PB_class, colors = c('black', 'red'),
marker = list(size = 4)) %>%
add_markers() %>%
layout(scene = scene3,
title = "Angle 3")
# 3D plot 4
plot.3d.stamps4 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = PB_class, colors = c('black', 'red'),
marker = list(size = 4)) %>%
add_markers() %>%
layout(scene = scene4,
title = "Angle 4")
plot.3d.stamps1
plot.3d.stamps2
plot.3d.stamps3
plot.3d.stamps4
do the outliers (forged stamps) look easy to detect in an unsupervised way
In terms of identifying these observations as outliers without the given class labels, more detailed analysis would need to be undertaken to look at the “outlyingness” scores of each observation. However if we are looking solely at the 3D plots to determine this, you are able to see that the outliers coloured in red are not part of the main cluster of the Principle Components and so you could infer some level of “outlyingness” for these observations. Having said that there are also a number of observations which are further away from the clustering of the PC’s which haven’t been labelled as outliers from the original class labels so i would’t be confident in defining outliers based solely on the PCA analysis.There also doesn’t appear to be any outliers appearing in the main cluster of observations within PCA which does show that the PCA analysis is sound in identifying the outliers to some degree.
How about in a supervised way?
Looking at the class labels in the 3D plot i believe the data could be used for supervised learning. Using class labels it is clear that the red observations are not part of the main cluster of data. There are observations which are further away which are marked as inliers however there is still a decent number of outliers clustered together. Depeding on the type of classification algorithm is used and the sensitivity requirements around classification, thresholds could potentially be used to help classify some of the other observations what also appear to be outliers in the 3D plots.i.e in logistic regression.
In this second activity, you are asked to perform unsupervised outlier detection on the Stamps data in the 9-dimensional space of the numerical predictors ( PB_Predictors ), using KNN Outlier with different values of the parameter (at least the following three: ). For each , produce the same 3D PCA visualisation of the data as in Activity 1 (PCA), but rather than using the class labels to colour the points, use instead the resulting KNN Outlier Scores as a continuous, diverging colour scale. Then, for each , produce a second plot where the top-31 outliers according to the KNN Outlier Scores are shown in red, while the other points are shown in black. Do these plots give you any insights on the values of that look more or less appropriate from an unsupervised perspective (ignoring the class labels)? Justify your answer.
# Activity 2
# intentionally do not use a loop so that you are able to display plots in diffferent tabs in RMD file
# set k
k <- c(5,25,100)
# unsupervised outlier detection using knndist()
# k = 5
KNN_Outlier.k5 <- kNNdist(x=PB_Predictors, k = k[1])[,k[1]] # KNN distance (outlier score) computation
# sort & display top 31 based on outlier scores
top_n <- 31 # No. of top outliers to be displayed
rank_KNN_Outlier.k5 <- order(x=KNN_Outlier.k5, decreasing = TRUE) # Sorting (descending)
KNN_Result.k5 <- data.frame(ID = rank_KNN_Outlier.k5, score = KNN_Outlier.k5[rank_KNN_Outlier.k5])
head(KNN_Result.k5, top_n)
## ID score
## 1 150 1.0831392
## 2 271 0.9452268
## 3 22 0.8701538
## 4 2 0.8176631
## 5 130 0.7382603
## 6 328 0.6434083
## 7 19 0.6400374
## 8 199 0.4853853
## 9 197 0.4831582
## 10 88 0.4626465
## 11 116 0.4620797
## 12 191 0.4619196
## 13 276 0.4239779
## 14 49 0.4228653
## 15 186 0.4188267
## 16 273 0.4150218
## 17 188 0.3989988
## 18 102 0.3716263
## 19 149 0.3706868
## 20 238 0.3685038
## 21 297 0.3653524
## 22 3 0.3650678
## 23 169 0.3559213
## 24 267 0.3557550
## 25 12 0.3552236
## 26 4 0.3532579
## 27 263 0.3448771
## 28 136 0.3406615
## 29 171 0.3377005
## 30 23 0.3282753
## 31 178 0.3199439
# k = 25
KNN_Outlier.k25 <- kNNdist(x=PB_Predictors, k = k[2])[,k[2]] # KNN distance (outlier score) computation
# sort & display top 31 based on outlier scores
rank_KNN_Outlier.k25 <- order(x=KNN_Outlier.k25, decreasing = TRUE) # Sorting (descending)
KNN_Result.k25 <- data.frame(ID = rank_KNN_Outlier.k25, score = KNN_Outlier.k25[rank_KNN_Outlier.k25])
head(KNN_Result.k25, top_n)
## ID score
## 1 150 1.2963491
## 2 271 1.0615617
## 3 22 1.0096438
## 4 2 0.9969271
## 5 328 0.9158535
## 6 130 0.8550355
## 7 19 0.7513259
## 8 199 0.5966591
## 9 276 0.5925294
## 10 197 0.5920398
## 11 88 0.5896104
## 12 171 0.5894291
## 13 238 0.5829931
## 14 169 0.5788086
## 15 116 0.5686392
## 16 179 0.5632258
## 17 273 0.5522833
## 18 198 0.5520698
## 19 334 0.5477680
## 20 24 0.5399514
## 21 25 0.5319735
## 22 191 0.5314182
## 23 263 0.5119748
## 24 267 0.5098777
## 25 49 0.5092998
## 26 186 0.4998188
## 27 26 0.4942470
## 28 149 0.4910287
## 29 16 0.4909396
## 30 188 0.4832703
## 31 316 0.4763570
# k = 100
KNN_Outlier.k100 <- kNNdist(x=PB_Predictors, k = k[3])[,k[3]] # KNN distance (outlier score) computation
# sort & display top 31 based on outlier scores
rank_KNN_Outlier.k100 <- order(x=KNN_Outlier.k100, decreasing = TRUE) # Sorting (descending)
KNN_Result.k100 <- data.frame(ID = rank_KNN_Outlier.k100, score = KNN_Outlier.k100[rank_KNN_Outlier.k100])
head(KNN_Result.k100, top_n)
## ID score
## 1 150 1.5027937
## 2 271 1.1867804
## 3 22 1.1317200
## 4 2 1.1002580
## 5 328 1.0519702
## 6 130 1.0366054
## 7 19 0.8689258
## 8 238 0.8155971
## 9 169 0.8117624
## 10 171 0.8028156
## 11 88 0.7894132
## 12 334 0.7701768
## 13 179 0.7664834
## 14 16 0.7528911
## 15 199 0.7516090
## 16 116 0.7475236
## 17 198 0.7337230
## 18 276 0.7317956
## 19 149 0.7125346
## 20 25 0.7076002
## 21 24 0.7062274
## 22 197 0.7046969
## 23 316 0.6814831
## 24 14 0.6769145
## 25 18 0.6761360
## 26 12 0.6750803
## 27 273 0.6735271
## 28 26 0.6709476
## 29 267 0.6708538
## 30 191 0.6583719
## 31 49 0.6572741
# Plot PCA - K = 5
plot.3d.knndist.k5 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = KNN_Outlier.k5,
marker = list(size = 4)) %>%
layout(title = "PCA ranked by KNN OUtlier score - K = 5")
# Plot PCA - K = 25
plot.3d.knndist.k25 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = KNN_Outlier.k25,
marker = list(size = 4)) %>%
layout(title = "PCA ranked by KNN OUtlier score - K = 25")
# Plot PCA - K = 100
plot.3d.knndist.k100 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = KNN_Outlier.k100,
marker = list(size = 4)) %>%
layout(title = "PCA ranked by KNN OUtlier score - K = 100")
# PCA with Top31 Outlier scores colour coded black & red
# create top 31 results for each k
k5.t31 <- head(KNN_Result.k5, top_n)
k25.t31 <- head(KNN_Result.k25, top_n)
k100.t31 <- head(KNN_Result.k100, top_n)
# bind together PCA data and KNN data for plotting for each k
PCA.k5 <- cbind(PCA.subset, KNN_Result.k5)
PCA.k25 <- cbind(PCA.subset, KNN_Result.k25)
PCA.k100 <- cbind(PCA.subset, KNN_Result.k100)
# left outer join/merge PCA & KNN results with top31 results to create a dataframe(s) that can be used for plotting
PCA.k5.t31 <- merge(PCA.k5, k5.t31 , by = "ID", all.x = TRUE)
PCA.k25.t31 <- merge(PCA.k25, k25.t31 , by = "ID", all.x = TRUE)
PCA.k100.t31 <- merge(PCA.k100, k100.t31 , by = "ID", all.x = TRUE)
# add column for colouring of plots
# k5
PCA.k5.t31 <- PCA.k5.t31 %>%
mutate(class = case_when(is.na(score.y) ~ "PCA Obs",
score.y > 0 ~ "Top 31 KNN Outlier"))
# k25
PCA.k25.t31 <- PCA.k25.t31 %>%
mutate(class = case_when(is.na(score.y) ~ "PCA Obs",
score.y > 0 ~ "Top 31 KNN Outlier"))
# k100
PCA.k100.t31 <- PCA.k100.t31 %>%
mutate(class = case_when(is.na(score.y) ~ "PCA Obs",
score.y > 0 ~ "Top 31 KNN Outlier"))
# Plot the top 31 KNN Outlier's against original PCA data
# k5
plot.3d.knndist.k5.t31 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = ~PCA.k5.t31$class,
text = rownames(PCA.k5.t31),
colors = c("black", "red"),
marker = list(size = 4)) %>%
layout(title = "PCA with Top 31 KNN OUtlier's Ranked on Score - K = 5")
# k25
plot.3d.knndist.k25.t31 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = ~PCA.k25.t31$class,
text = rownames(PCA.k25.t31),
colors = c("black", "red"),
marker = list(size = 4)) %>%
layout(title = "PCA with Top 31 KNN OUtlier's Ranked on Score - K = 25")
# k100
plot.3d.knndist.k100.t31 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = ~PCA.k100.t31$class,
text = rownames(PCA.k100.t31),
colors = c("black", "red"),
marker = list(size = 4)) %>%
layout(title = "PCA with Top 31 KNN OUtlier's Ranked on Score - K = 100")
# Plots
plot.3d.knndist.k5
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot.3d.knndist.k25
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot.3d.knndist.k100
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot.3d.knndist.k5.t31
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot.3d.knndist.k25.t31
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot.3d.knndist.k100.t31
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
Do these plots give you any insights on the values of that look more or less appropriate from an unsupervised perspective (ignoring the class labels)?
Looking at the 3 plots with the continuous colour scale it is clear that there isn’t a large number of observations which have a high outlier score. This is clear as the number of observations which have a yellow type colour is quite low between the 3 plots. You are able to identify a large cluster of data which has the dark blue colour and you can see that observations increase their distance from that cluster their outlier score is increasing as the colour changes based on the colour scale. The difference in K doesn’t seem to have much impact on the clstering and outlier detection as the plots are very similar. The socres themselves change slightly as they increase with K = 100 however the plots aren’t really effected.
Analysing the Top31 plots its similar to above. The different K values doesn’t seem to change the plots much at all. If i had to use this model in an unsupervised way i would keep K = 5 or potentiall do some more analysis to see if that could be reduced even further based on potentially overfitting the model.
1.Perform supervised classification of the Stamps data, using a KNN classifier with the same values of as used in Activity 2 (unsupervised outlier detection). For each classifier (that is, each value of ), compute the Area Under the Curve ROC (AUC-ROC) in a Leave-One-Out Cross-Validation (LOOCV) scheme.
# 1
# Function to create ROC plot
rocplot <- function(pred, truth){
predobj <- prediction(pred, truth)
ROC <- performance(predobj, "tpr", "fpr")
# Plot the ROC Curve
plot(ROC)
auc <- performance(predobj, measure = "auc")
auc <- auc@y.values[[1]]
# Return the Area Under the Curve ROC
return(auc)
}
# separate class labels from predictors
predictors <- stamps[c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9")]
class_labels <- stamps[, "V10"]
# Loop and plot through each k value of 5, 25, 100
for (i in 1:3){
Pred_class <- knn.cv(train=predictors, cl=class_labels, k=k[i], prob = TRUE)
Pred_prob <- attr(Pred_class, "prob")
Pred_prob <- ifelse(Pred_class=='yes', Pred_prob, 1 - Pred_prob)
AUC <- rocplot(pred=Pred_prob, truth=class_labels)
abline(a=0, b= 1)
text(x = .40, y = .6,paste("AUC = ", round(AUC[[1]],3), sep = ""))
}
2. Compare the resulting (supervised) KNN classification performance for each value of , against the classification performance obtained in an unsupervised way by the KNN Outlier method with the same value of . Notice that, if we rescale the KNN Outlier Scores (obtained in Activity 2 (unsupervised outlier detection)) into the interval, these scores can be interpreted as outlier probabilities, which can then be compared with the class labels (ground truth) in PB_class to compute an AUC-ROC value. This way, for each value of , the AUC-ROC of the supervised KNN classifier can be compared with the AUC-ROC of KNN Outlier as an unsupervised classifier. Compare the performances of the supervised versus unsupervised classifiers and discuss the results. For example, recalling that the supervised method makes use of the class labels, whereas the unsupervised method doesn’t, what can you conclude considering there are applications where class labels are not available?
# 2
kNN_scale.k5 <- (KNN_Outlier.k5 - min(KNN_Outlier.k5)) / (max(KNN_Outlier.k5) - min(KNN_Outlier.k5))
kNN_scale.k25 <- (KNN_Outlier.k25 - min(KNN_Outlier.k25)) / (max(KNN_Outlier.k25) - min(KNN_Outlier.k25))
kNN_scale.k100 <- (KNN_Outlier.k100 - min(KNN_Outlier.k100)) / (max(KNN_Outlier.k100) - min(KNN_Outlier.k100))
# Compare supervised against unsupervised and plot K = 5
# create 2 predictor classes for comparison. Supervised (knn.cv) & Unsupervised (Oulier score)
supervised.pred.k5 <- prediction(Pred_prob, class_labels)
unsupervised.pred.k5 <- prediction(kNN_scale.k5, class_labels)
# create 2 performance classes for comparison. Supervised (knn.cv) & Unsupervised (Oulier score)
supervised.perf.k5 <- performance( supervised.pred.k5, "tpr", "fpr" )
unsupervised.perf.k5 <- performance(unsupervised.pred.k5, "tpr", "fpr")
# get auc for K = 5 (outlier score)
auc.k5 <- performance(unsupervised.pred.k5, measure = "auc")
auc.k5 <- auc.k5@y.values[[1]]
# Compare supervised against unsupervised and plot K = 25
# create 2 predictor classes for comparison. Supervised (knn.cv) & Unsupervised (Oulier score)
supervised.pred.k25 <- prediction(Pred_prob, class_labels)
unsupervised.pred.k25 <- prediction(kNN_scale.k25, class_labels)
# create 2 performance classes for comparison. Supervised (knn.cv) & Unsupervised (Oulier score)
supervised.perf.k25 <- performance(supervised.pred.k25, "tpr", "fpr" )
unsupervised.perf.k25 <- performance(unsupervised.pred.k25, "tpr", "fpr")
# get auc for K = 25 (outlier score)
auc.k25 <- performance(unsupervised.pred.k25, measure = "auc")
auc.k25 <- auc.k25@y.values[[1]]
# Compare supervised against unsupervised and plot K = 100
# create 2 predictor classes for comparison. Supervised (knn.cv) & Unsupervised (Oulier score)
supervised.pred.k100 <- prediction(Pred_prob, class_labels)
unsupervised.pred.k100 <- prediction(kNN_scale.k100, class_labels)
# create 2 performance classes for comparison. Supervised (knn.cv) & Unsupervised (Oulier score)
supervised.perf.k100 <- performance(supervised.pred.k100, "tpr", "fpr" )
unsupervised.perf.k100 <- performance(unsupervised.pred.k100, "tpr", "fpr")
# get auc for K = 100 (outlier score)
auc.k100 <- performance(unsupervised.pred.k100, measure = "auc")
auc.k100 <- auc.k100@y.values[[1]]
# plot for comparison
plot(supervised.perf.k5, col = "dark blue", main = "Supervised vs Unsupervised K = 5")
plot(unsupervised.perf.k5, add = TRUE, col = "dark green")
legend("topleft", legend = c("Supervised", "Unsupervised"),
col = c("dark blue", "dark green"), lty = 1, cex = 0.8)
abline(a=0, b= 1)
text(x = .80, y = .27,paste("AUC Supervised = ", round(auc.k5[[1]],3), sep = ""), col = "dark blue")
text(x = .79, y = .21,paste("AUC Unsupervised = ", 0.942, sep = ""), col = "dark green")
# plot for comparison
plot(supervised.perf.k25, col = "dark blue", main = "Supervised vs Unsupervised K = 25")
plot(unsupervised.perf.k25, add = TRUE, col = "dark green")
legend("topleft", legend = c("Supervised", "Unsupervised"),
col = c("dark blue", "dark green"), lty = 1, cex = 0.8)
abline(a=0, b= 1)
text(x = .80, y = .27,paste("AUC Supervised = ", round(auc.k25[[1]],3), sep = ""), col = "dark blue")
text(x = .79, y = .21,paste("AUC Unsupervised = ", 0.942, sep = ""), col = "dark green")
# plot for comparison
plot(supervised.perf.k100, col = "dark blue", main = "Supervised vs Unsupervised K = 100")
plot(unsupervised.perf.k100, add = TRUE, col = "dark green")
legend("topleft", legend = c("Supervised", "Unsupervised"),
col = c("dark blue", "dark green"), lty = 1, cex = 0.8)
abline(a=0, b= 1)
text(x = .80, y = .27,paste("AUC Supervised = ", round(auc.k100[[1]],3), sep = ""), col = "dark blue")
text(x = .79, y = .21,paste("AUC UnSupervised = ", 0.942, sep = ""), col = "dark green")
Compare the performances of the supervised versus unsupervised classifiers and discuss the results. For example, recalling that the supervised method makes use of the class labels, whereas the unsupervised method doesn’t, what can you conclude considering there are applications where class labels are not available?
The first difference between the plots is that the AUC increases with the value of K increasing. As mentioned above this is to expected and further analysis around the KNN Outlier logic to understand whether the model is becoming overfitted and what the most efficient value of K should be. As the value of the AUC increases we are able to see that the model’s prediction starts to improve with the true positive rate also increasing. As it’s not always possible to perform supervised learning we can conclude that deeper analysis on the right value of K should be undertaken to understand model overfitting. Looking solely at the plots however as K increases so does the AUC and so does the true positive rate. The false positive rate doesn’t change and this is expected as none of the plots seen in this report have has misclassification of class labels or outliers identified as inliers.